Fix PPL increase caused by mmq_id #913
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@Nexesenex noticed that the
mmq_idMoE matrix multiplication approach added in #728 leads to a non-negligible increase in perplexity (#728 (comment)). As PR #728 is derived from PR 15525 in mainlinellama.cpp, the same issue exists there (see also #728 (comment)). As a result, I added the ability to disablemmq_idvia a command line parameter in #910.I don't like the #910 solution because disabling
mmq_idleads to a significantly lower PP performance for small batch sizes. So, after some back-and-fort with @JohannesGaessler, I went ahead and changed themmq_idimplementation to pretend that the number of streaming multiprocessors (SM) is a power of 2. This does in fact fix the PPL increase for all models I tried (GLM-4.5-AIR, Ling-mini-2.0, Qwen3-30B-A3B, DeepSeek-Lite). Based on experimentation with these models, it seems using the lowest power of two that is >= number of SM gives better performance than using the highest power of two that is <= number of SM.I don't see any negative impact on performance. If anything, on my RTX-4080 GPU PP performance is very slightly better.
Please try and report any performance degradation (or performance improvement). To make sure that you are using
mmq_idrather than the originalik_llama.cppMoE matrix multiplication implementation, add-cuda mmq-id-size=1000to your command line.